Fine-tuned based on the Qwen/Qwen2.5-1.5B-Instruct model, using the TinyV reward system, which can provide more accurate reward signals in the post-training of efficient reinforcement learning (RL) and significantly improve RL efficiency and the performance of the final model.
Large Language Model
Transformers